Prior beginning, source all the required dependencies.

source('src/lib.R')

The data you will be working with

Let’s have a look

get_full_dataset() %>%
  group_by(type) %>%
  summarise('mean_x' = mean(x),
            'sd_x' = sd(x),
            'mean_y' = mean(y),
            'sd_y' = sd(y))
## # A tibble: 4 x 5
##   type    mean_x  sd_x mean_y  sd_y
##   <chr>    <dbl> <dbl>  <dbl> <dbl>
## 1 circles  0.518 0.245  0.497 0.245
## 2 linear   0.518 0.295  0.483 0.301
## 3 normal   0.480 0.297  0.484 0.305
## 4 spirals  0.518 0.254  0.491 0.245

Pretty insightful, don’t ya?

Indeed, most of the time, statistic do not tell anything about the true nature data you have.


Ok let’s get serious and plot someting more meaningful

get_full_dataset() %>% ggplot(aes(x = x, y = y, color = class)) +
  geom_point() + facet_wrap(~type) + scale_color_fivethirtyeight() + theme_fivethirtyeight() +
  labs(color = "") + theme(legend.position = "none")

Woooooo!!1!11!!


Let’s fit!

Yes but how?

The cool think about doing data science with a scripting language is that you do not need to be neither a computer scientist nor a statistician to make someting.

library(caret)

Ofc knowing the theoretical underpinnings can be helpful but what you really need is to know which approach suits bettwr your problem… and you are done.

You won’t win a kaggle competition but you will get somewhere.